Goto

Collaborating Authors

 token reduction


Dynamic Template Selection for Output Token Generation Optimization: MLP-Based and Transformer Approaches

Yadavalli, Bharadwaj

arXiv.org Artificial Intelligence

Contemporary large language model deployments typically employ uniform prompting strategies across diverse query types, applying verbose response patterns to both complex analytical tasks and straightforward factual questions. This one-size-fits-all methodology leads to substantial token inefficiency, a concern amplified by the significant cost differential between input and output tokens--the latter commanding 4-8x higher prices across major providers. We present Dynamic Template Selection (DTS), which adaptively matches response templates to query complexity, achieving significant cost reductions without compromising response quality. We compared two routing approaches: a simple MLP that uses pre-computed embeddings and a more complex fine-tuned RoBERTa transformer. Through comprehensive evaluation on 1,000 MMLU questions, we find that the MLP router achieves 90.5% routing accuracy on held-out test data, marginally exceeding RoBERTa's performance (89.5%) despite utilizing 125M fewer parameters. Notably, our empirical analysis reveals provider-agnostic behavior in template selection--routing decisions generalize effectively across 3 major LLM providers (OpenAI GPT-4, Google Gemini, and Anthropic Claude), as validated through 9,000 production API calls. While routing accuracy remains consistent at 90.5% across providers, observed token reductions vary from 32.6% to 33.9%, reflecting provider-specific generation characteristics. This work contributes several key elements: formal problem formulation with theoretical grounding in machine learning, four algorithms with corresponding complexity analyses, and extensive empirical validation across production systems.


Frequency-Aware Token Reduction for Efficient Vision Transformer

Lee, Dong-Jae, Hur, Jiwan, Choi, Jaehyun, Yu, Jaemyung, Kim, Junmo

arXiv.org Artificial Intelligence

Vision Transformers have demonstrated exceptional performance across various computer vision tasks, yet their quadratic computational complexity concerning token length remains a significant challenge. To address this, token reduction methods have been widely explored. However, existing approaches often overlook the frequency characteristics of self-attention, such as rank collapsing and over-smoothing phenomenon. In this paper, we propose a frequency-aware token reduction strategy that improves computational efficiency while preserving performance by mitigating rank collapsing. Our method partitions tokens into high-frequency tokens and low-frequency tokens. high-frequency tokens are selectively preserved, while low-frequency tokens are aggregated into a compact direct current token to retain essential low-frequency components. Through extensive experiments and analysis, we demonstrate that our approach significantly improves accuracy while reducing computational overhead and mitigating rank collapsing and over smoothing. Furthermore, we analyze the previous methods, shedding light on their implicit frequency characteristics and limitations.


Let LRMs Break Free from Overthinking via Self-Braking Tuning

Zhao, Haoran, Yan, Yuchen, Shen, Yongliang, Xu, Haolei, Zhang, Wenqi, Song, Kaitao, Shao, Jian, Lu, Weiming, Xiao, Jun, Zhuang, Yueting

arXiv.org Artificial Intelligence

Large reasoning models (LRMs), such as OpenAI o1 and DeepSeek-R1, have significantly enhanced their reasoning capabilities by generating longer chains of thought, demonstrating outstanding performance across a variety of tasks. However, this performance gain comes at the cost of a substantial increase in redundant reasoning during the generation process, leading to high computational overhead and exacerbating the issue of overthinking. Although numerous existing approaches aim to address the problem of overthinking, they often rely on external interventions. In this paper, we propose a novel framework, Self-Braking Tuning (SBT), which tackles overthinking from the perspective of allowing the model to regulate its own reasoning process, thus eliminating the reliance on external control mechanisms. We construct a set of overthinking identification metrics based on standard answers and design a systematic method to detect redundant reasoning. This method accurately identifies unnecessary steps within the reasoning trajectory and generates training signals for learning self-regulation behaviors. Building on this foundation, we develop a complete strategy for constructing data with adaptive reasoning lengths and introduce an innovative braking prompt mechanism that enables the model to naturally learn when to terminate reasoning at an appropriate point. Experiments across mathematical benchmarks (AIME, AMC, MATH500, GSM8K) demonstrate that our method reduces token consumption by up to 60% while maintaining comparable accuracy to unconstrained models.


The Hidden Cost of Readability: How Code Formatting Silently Consumes Your LLM Budget

Pan, Dangfeng, Sun, Zhensu, Zhang, Cenyuan, Lo, David, Du, Xiaoning

arXiv.org Artificial Intelligence

Source code is usually formatted with elements like indentation and newlines to improve readability for human developers. However, these visual aids do not seem to be beneficial for large language models (LLMs) in the same way since the code is processed as a linear sequence of tokens. Furthermore, these additional tokens can lead to increased computational costs and longer response times for LLMs. If such formatting elements are non-essential to LLMs, we can reduce such costs by removing them from the code. To figure out the role played by formatting elements, we conduct a comprehensive empirical study to evaluate the impact of code formatting on LLM performance and efficiency. Through large-scale experiments on Fill-in-the-Middle Code Completion tasks across four programming languages (Java, Python, C++, C\#) and ten LLMs-including both commercial and open-source models-we systematically analyze token count and performance when formatting elements are removed. Key findings indicate that LLMs can maintain performance across formatted code and unformatted code, achieving an average input token reduction of 24.5\% with negligible output token reductions. This makes code format removal a practical optimization strategy for improving LLM efficiency. Further exploration reveals that both prompting and fine-tuning LLMs can lead to significant reductions (up to 36.1\%) in output code length without compromising correctness. To facilitate practical applications, we develop a bidirectional code transformation tool for format processing, which can be seamlessly integrated into existing LLM inference workflows, ensuring both human readability and LLM efficiency.


TokenLearner: Adaptive Space-Time T okenization for Videos - Supplementary Materials - Michael S. Ryoo

Neural Information Processing Systems

We train the Kinetics model for 30 epochs with the base learning rate of 0.05 with the Momentum Basically, all the settings in our Kinetics experiments follow the setting of ViViT. We provide the training details as below. We use the cosine-decay learning rate which was popularly used in many video CNN model trainings. The base learning rate of 0.8 per TPU core (which is equivalent to a single GPU) is used for the Charades dataset (multi-label action classification) and the base rate of 0.025 per TPU was used for A ViD. Label smoothing of 0.2 was used for the A ViD training. In Charades, the training was done by temporally cropping a long Charades videos (e.g., In all these experiments, ViT L/16 model was used.


webMCP: Efficient AI-Native Client-Side Interaction for Agent-Ready Web Design

Perera, D.

arXiv.org Artificial Intelligence

Current AI agents create significant barriers for users by requiring extensive processing to understand web pages, making AI-assisted web interaction slow and expensive. This paper introduces webMCP (Web Machine Context & Procedure), a client-side standard that embeds structured interaction metadata directly into web pages, enabling more efficient human-AI collaboration on existing websites. webMCP transforms how AI agents understand web interfaces by providing explicit mappings between page elements and user actions. Instead of processing entire HTML documents, agents can access pre-structured interaction data, dramatically reducing computational overhead while maintaining task accuracy. A comprehensive evaluation across 1,890 real API calls spanning online shopping, authentication, and content management scenarios demonstrates webMCP reduces processing requirements by 67.6% while maintaining 97.9% task success rates compared to 98.8% for traditional approaches. Users experience significantly lower costs (34-63% reduction) and faster response times across diverse web interactions. Statistical analysis confirms these improvements are highly significant across multiple AI models. An independent WordPress deployment study validates practical applicability, showing consistent improvements across real-world content management workflows. webMCP requires no server-side modifications, making it deployable across millions of existing websites without technical barriers. These results establish webMCP as a viable solution for making AI web assistance more accessible and sustainable, addressing the critical gap between user interaction needs and AI computational requirements in production environments.


Compressing Chain-of-Thought in LLMs via Step Entropy

Li, Zeju, Zhong, Jianyuan, Zheng, Ziyang, Wen, Xiangyu, Xu, Zhijian, Cheng, Yingying, Zhang, Fan, Xu, Qiang

arXiv.org Artificial Intelligence

Large Language Models (LLMs) using Chain-of-Thought (CoT) prompting excel at complex reasoning but generate verbose thought processes with considerable redundancy, leading to increased inference costs and reduced efficiency. We introduce a novel CoT compression framework based on step entropy, a metric that quantifies the informational contribution of individual reasoning steps to identify redundancy. Through theoretical analysis and extensive empirical validation on mathematical reasoning benchmarks, we demonstrate that steps with low entropy are indeed highly redundant. Our experiments reveal that an astonishing 80% of low-entropy intermediate steps can be pruned with minor degradation in the final answer accuracy across DeepSeek-R1-7B, 14B and Qwen3-8B. This finding sharply contrasts with random or high-entropy pruning, which severely impairs reasoning performance. Building on this, we propose a novel two-stage training strategy combining Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) reinforcement learning. This approach enables LLMs to autonomously learn to generate compressed COTs during inference by strategically incorporating [SKIP] tokens. Our method significantly enhances LLM inference efficiency while rigorously preserving accuracy, offering profound implications for practical LLM deployment and a deeper understanding of reasoning structures. The code and data are released in https://github.com/staymylove/COT


ToFe: Lagged Token Freezing and Reusing for Efficient Vision Transformer Inference

Zhang, Haoyue, Zhang, Jie, Guo, Song

arXiv.org Artificial Intelligence

--Although vision transformers (ViT) have shown remarkable success in various vision tasks, their computationally expensive self-attention hinder their deployment on resource-constrained devices. T oken reduction, which discards less important tokens during forward propagation, has been proposed to enhance the efficiency of transformer models. However, existing methods handle unimportant tokens irreversibly, preventing their reuse in subsequent blocks. Considering that transformers focus on different information among blocks, tokens reduced in early blocks might be useful later . Furthermore, to adapt transformer models for resource-constrained devices, it is crucial to strike a balance between model performance and computational overhead. T o address these challenges, in this paper, we introduce a novel T oken Freezing and Reusing (T oFe) framework, where we identify important tokens at each stage and temporarily freeze the unimportant ones, allowing their lagged reusing at a later stage. Specifically, we design a prediction module for token identification and an approximate module for recovery of the frozen tokens. By jointly optimizing with the backbone through computation budget-aware end-to-end training, T oFe can adaptively process the necessary tokens at each block, thereby reducing computational cost while maintaining performance. Extensive experiments demonstrate that T oFe reduces the computational cost of L V-ViT model by 50% with less than 2% drop in T op-1 accuracy, achieving a better trade-off between performance and complexity compared to state-of-the-art methods. Large-scale pre-trained vision transformer (ViT) models [37] have achieved remarkable progress in the field of vision tasks.


Silent Hazards of Token Reduction in Vision-Language Models: The Hidden Impact on Consistency

Sun, Yizheng, Li, Hao, Xu, Chang, Lin, Chenghua, Batista-Navarro, Riza, Sun, Jingyuan

arXiv.org Artificial Intelligence

Vision language models (VLMs) have excelled in visual reasoning but often incur high computational costs. One key reason is the redundancy of visual tokens. Although recent token reduction methods claim to achieve minimal performance loss, our extensive experiments reveal that token reduction can substantially alter a model's output distribution, leading to changes in prediction patterns that standard metrics such as accuracy loss do not fully capture. Such inconsistencies are especially concerning for practical applications where system stability is critical. To investigate this phenomenon, we analyze how token reduction influences the energy distribution of a VLM's internal representations using a lower-rank approximation via Singular Value Decomposition (SVD). Our results show that changes in the Inverse Participation Ratio of the singular value spectrum are strongly correlated with the model's consistency after token reduction. Based on these insights, we propose LoFi--a training-free visual token reduction method that utilizes the leverage score from SVD for token pruning. Experimental evaluations demonstrate that LoFi not only reduces computational costs with minimal performance degradation but also significantly outperforms state-of-the-art methods in terms of output consistency.


Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching

Aytes, Simon A., Baek, Jinheon, Hwang, Sung Ju

arXiv.org Artificial Intelligence

Recent advances in large language models have demonstrated remarkable reasoning capabilities through Chain of Thought (CoT) prompting, but often at the cost of excessive verbosity in their intermediate outputs, which increases computational overhead. We introduce Sketch-of-Thought (SoT), a novel prompting framework that combines cognitive-inspired reasoning paradigms with linguistic constraints to minimize token usage while preserving reasoning accuracy. SoT is designed as a flexible framework that can incorporate any custom reasoning paradigms based on cognitive science, and we instantiate it with three such paradigms - Conceptual Chaining, Chunked Symbolism, and Expert Lexicons - each tailored to different reasoning tasks and selected dynamically via a lightweight routing model. Through comprehensive evaluation across 15 reasoning datasets with multiple languages and multimodal scenarios, we demonstrate that SoT achieves token reductions of 76% with negligible accuracy impact. In certain domains like mathematical and multi-hop reasoning, it even improves accuracy while using significantly fewer tokens. Our code is publicly available: https://www.github.com/SimonAytes/SoT.